Coordinate Descent on the Orthogonal Group for Recurrent Neural Network Training
نویسندگان
چکیده
We address the poor scalability of learning algorithms for orthogonal recurrent neural networks via use stochastic coordinate descent on group, leading to a cost per iteration that increases linearly with number states. This contrasts cubic dependency typical feasible such as Riemannian gradient descent, which prohibits big network architectures. Coordinate rotates successively two columns matrix. When (i.e., indices rotated columns) is selected uniformly at random each iteration, we prove convergence algorithm under standard assumptions loss function, stepsize and minibatch noise. In addition, numerically show has an approximately sparse structure. Leveraging this observation, propose variant our proposed relies Gauss-Southwell selection rule. Experiments benchmark training problem approach very promising step towards
منابع مشابه
Recurrent neural network training with preconditioned stochastic gradient descent
Recurrent neural networks (RNN), especially the ones requiring extremely long term memories, are difficult to training. Hence, they provide an ideal testbed for benchmarking the performance of optimization algorithms. This paper reports test results of a recently proposed preconditioned stochastic gradient descent (PSGD) algorithm on RNN training. We find that PSGD may outperform Hessian-free o...
متن کاملon descent spectral cg algorithm for training recurrent neural networks
In this paper, we evaluate the performance of a new class of conjugate gradient methods for training recurrent neural networks which ensure the sufficient descent property. The presented methods preserve the advantages of classical conjugate gradient methods and simultaneously avoid the usually inefficient restarts. Simulation results are also presented using three different recurrent neural ne...
متن کاملAccelerating Recurrent Neural Network Training
An efficient algorithm for recurrent neural network training is presented. The approach increases the training speed for tasks where a length of the input sequence may vary significantly. The proposed approach is based on the optimal batch bucketing by input sequence length and data parallelization on multiple graphical processing units. The baseline training performance without sequence bucket...
متن کاملEfficient coordinate-descent for orthogonal matrices through Givens rotations
Optimizing over the set of orthogonal matrices is a central component in problems like sparse-PCA or tensor decomposition. Unfortunately, such optimization is hard since simple operations on orthogonal matrices easily break orthogonality, and correcting orthogonality usually costs a large amount of computation. Here we propose a framework for optimizing orthogonal matrices, that is the parallel...
متن کاملCoordinate-descent for learning orthogonal matrices through Givens rotations
Optimizing over the set of orthogonal matrices is a central component in problems like sparsePCA or tensor decomposition. Unfortunately, such optimization is hard since simple operations on orthogonal matrices easily break orthogonality, and correcting orthogonality usually costs a large amount of computation. Here we propose a framework for optimizing orthogonal matrices, that is the parallel ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Proceedings of the ... AAAI Conference on Artificial Intelligence
سال: 2022
ISSN: ['2159-5399', '2374-3468']
DOI: https://doi.org/10.1609/aaai.v36i7.20742